arXiv: math.PRxxxx.xxxxx 



(N 



O 



00 



r- 



The Bayesian Analysis of Complex, 

High-Dimensional Models: 

Can it be CODA? 

Y. Ritov§, P. J. Bickel*, A. C. Gamstt, B. J. K. Kleijnt, 



Department of Statistics, The Hebrew University, 91905 Jerusalem, Israel; 
e-mail: yaacov.ritovagmall.com; url: http://pluto.inscc.huji.ac.il/~yaacov 
SH Department of Statistics, University of California, Berkeley, CA 94720-3860, USA; 

, ^ e-mail: bickelOstat .berkeley,edu; url: http://www.stat.berkeley.edu/~bickel 

'^i Biostatistics and Bioinformatics, University of California, San Diego, CA 92093-0717, 

^^ USA; e-mail: acgamstamath.ucsd.edu; url: http://biostat.ucsd.edu/acgamst.htm 

l/~^ Korteweg-De Vries Instituut voor Wiskunde, POSTBUS 94248, 1090 GE Amsterdam, 

^nJ Kamer: C4-135; The Netherlands; e-mail: B. J.K.KleijnSuva.nl; url: 

http://home .medewerker .uva.nl /b.j .k.kleijn 

L^ Abstract: Wc consider the Bayesian analysis of a few complex, high- 

dimensional models and show that intuitive priors, which are not tailored 



to the fine details of the data model and the estimated parameters are going 
r^ to fail in situations in which simple good frequentist estimators exit. The 

models we consider are, partially observed sample, the partial linear model, 
estimating linear and quadratic functionals of a white noise models, and 
estimating with stopping times. We argue that these findings do not con- 
tradict a strong version of Doob's consistency theorem which claims that 
the existence of a uniformly y/n consistent estimator ensures that the Bayes 
posterior is y^ consistent for values of the parameter with prior probability 
1. 



Keywords and phrases: Foundations, CODA, Bayesian inference. White 
noise models. Partial linear model. Stopping time. Functional estimation, 



\l Semiparametrics 

in 

en 

^^ 1. Introduction 

(N 

. . We study in this paper a few examples of Bayesian procedures on complex, high- 

^ dimensional parameter spaces. Bayesian procedures can be considered from dif- 

k> ferent points of view. Their closure is the set of admissible procedures, and they 

j_| are known to generate asymptotic minimax procedures in regular parametric 

C^ models. These and similar notions are frequentist in nature, and are not the 

main focus of the Bayesian paradigm. 

The Bayesian procedures we consider are those that adhere to the following 
paradigm. The prior distribution is announced prior to observing the data. If 
this is viewed as too unrealistic, we at least restrict to priors that do not depend 

t.' 
t. 

s Research supported by an ISF grant. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 2 

on details of the experimental design or on knowing the specific functions of the 
parameters that may turn out to be of interest. In this paradigm, it would not, 
for example, be reasonable for a statistician to use one prior for estimating hi{'d) 
and another to estimate h2{'&), unless hi = h2. 

We must necessarily approach matters from a "robustness" point of view and 
consider what happens if the parameter has probability under the assumed 
prior, as would happen with all reasonable "atom-less" priors on continuous pa- 
rameter spaces. That is, we study Bayesian procedures from a frequentist point 
of view in the tradition of Bernstein, von Mises and Le Cam, and more recently 
Cox (1993), Diaconis and Freedman (1998), Freedman (1993), Freedman (1999), 
and also of Bayarri and Berger (2004) . 

The extent to which the subjective aspect of data analysis is central to the 
modern Bayesian point of view is debateable. See the dialog between Goldstein 
(2006) and Berger (2006a) and the discussion of these two papers. However, 
central to any Bayesian approach is the posterior distribution and the choice of 
prior. Even those who try to reconcile Bayesian and frequentist approaches, ef. 
Bayarri and Berger (2004), tend to give, in the case of conflict, a stronger pref- 
erence to inferences based on the posterior, rather than frequentist properties, 
cf. Berger (2006b). 

Most early discussions of Bayesian analysis presented simple examples, e.g., 
X ^ N{'0, 1). In this case, a statistician might have clear a priori ideas about 
1?, and might well understand the implications of using one prior in place of 
another. Regardless, the data will eventually overwhelm the prior, and typically 
frequentist and Bayesian inference will coincide. The classical Bernstein-von 
Mises Theorem encapsulate this observation, see Le Cam and Yang (1990) or 
Lehmann and Casella (1998). Currently, Bayesian procedures are being applied 
to complex, high-dimensional models, e.g., those used in medical imaging. With 
a very high-dimensional parameter space (where laws of large numbers appear, 
"uniform" distributions are concentrated on shells, etc), it is very difficult to 
understand the implications of using a particular prior (in place of another). It 
is very difficult if not impossible to express subjective information about the 
model in a robust prior, and it is difficult to express this knowledge in a way 
that would support the data analysis and not dominate it. This is the situation 
we want to address in the current paper, and to some extent has already been 
considered by Cox (1993), Freedman (1993), and others. 

Admittedly, there is a body of theory in the area, cf. Ghosal, Ghosh and 
van der Vaart (2000), Kleijn and van der Vaart (2006), and Bickel and Kleijn 
(2012), among others, giving specific conditions under which some finite dimen- 
sional intuition persists in higher dimensions. However, in this paper we empha- 
size how easily these conditions can be violated and the dramatic consequences 
of such violations. 

It is argued "that selection of prior distributions will rarely follow the ide- 
alized scenario of being done without reference to the data or experimental 
structure. After all, models are often selected only after a careful examination 
of the data, so how could a prior on model parameters have been selected be- 
forehand? . . . Bayesian model selection can temper the 'desire' of the data to be 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 3 

ovcrfittcd, by bringing in prior weights that can be assigned to models" (Berger 
(1985), pg. 284). We argue that, for Bayesian modehng to have good frequentist 
properties, one has to consider not only aspects of the experimental structure, 
but also the specifics of the parameter of interest, and the fine details of the 
design. In contrast to protection against overfitting, priors may cause underfit- 
ting (i.e., believing what the prior says, even when it is not supported by the 
data). A prior that fits the task, is not a prior. We are reminded of Groucho 
Marx's quote "Those are my principles, and if you don't like them . . . well, I 
have others." In short, we argue that the only way to judge whether a prior is 
good, is to check the behavior of the resulting Bayesian estimator. We do not 
argue that there are no good Bayesian estimators. But wc do argue that the 
arguments that justify their use cannot be Bayesian. 

We use several examples to illustrate a number of issues. In Section 2, we 
replicate the results of Robins and Ritov (1997) in a missing data problem for 
which a simple (non-Bayesian) estimator is efficient, while the application of any 
prior which adheres to the strict paradigm discussed above forces us to implicitly 
estimate a infinite-dimensional parameter, and leads to an estimator of the 
parameter of interest with a slow rate of convergence. This argument is aimed 
primarily at estimators which are eflficient in the frequentist sense, and failures of 
the strict likelihood principle; but if the Bernstein- von Mises phenomenon holds, 
the corresponding Bayesian estimators are necessarily efficient. In Section 3 we 
consider the partial linear model of Engle, Granger, Rice and Weiss (1986). 
In this case, if the nonparametric part of the model is smooth enough, the 
Bernstein-von Mises phenomenon holds and Bayesian estimators are efficient, 
under some conditions on the prior, but frequentist estimation can get away 
with significantly less smoothness in the nonparametric regression term. 

Sections 4 and 5 deal with parameters in the Gaussian white noise model, 
^i — f^i+Ei, i — 1,2, ... ,61,62, ■■ ■ i.i.d. iV(0, l/n). In Section 4 we show that for 
any Bayes prior concentrating on sequences of means which decay slowly enough, 
there exist linear functionals of the parameters whose posterior distribution 
does not converge to the truth at the n~^'^ rate, whereas simple frequentist 
estimators always do. This phenomenon reflects among other things that it 
is quite possible to have parallel mutually independent sequences which have 
similar temporal correlation and exhibit strong cross-correlation. This somewhat 
surprising situation is discussed in Appendix A. 

This result is coupled with a dramatic example of the failure of a plausible 
prior on images to produce a decent estimate of a linear parameter of the noisy 
image as opposed to a naive frequentist estimator. 

In Section 5 we study a quadratic functional which behaves as the slope does 
in the partial linear model. We argue there that "natural" reference priors only 
work over a limited range. 

In Section 6 we give an example in which Bayesian procedures which ignore 
the stopping time associated with the data generating process fail, while simple 
frequentist procedures continue to work. This demonstrates the danger of the 
classical principle that Bayesians need not pay attention to stopping times. 

Throughout, we argue that the parameter values for which the Bayes pro- 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 4 

cedures fail arc not atypical. A version of Doob's consistency theorem does 
however hold. If there exist ^/n consistent frequentist estimates, failure of -^/n 
consistency of the posterior can only hold on sets of prior probability 0. This is 
demonstrated in Section 7, where we also review and discuss our findings. 

2. Continuously Stratified Random Sampling 

Robins and Ritov (1997) consider an infinite-dimensional model of continu- 
ously stratified random sampling in which one has n i.i.d. observations Wi = 
{Xi, Ri, Zi) with Xi G [0, 1]"^, Zi — RiYi, and Ri, Yi e {0, 1} and are condition- 
ally independent given Xi, with g{X) = E(i?|X) known and h{X) = ^{Y\X) 
unknown. The parameter of interest is i? = E(y). For discussion of this model 
see also Wasserman (1998) and Harmeling and Toussaint (2007). 

It is relatively easy to construct a reasonable estimate of 'd. Indeed, the clas- 
sical Horvitz-Thompson estimator, cf. Cochran (1977) 

n 

i9 = n'i^Z,/5(X,) 

solves the problem nicely. Because, 

¥.{RY/g{X)} = E{E(i?|X)E(r|X)/5(X)} 

= EE(y|x) = ??, 

the estimator is consistent without any further assumptions. If we assume that 
g is bounded from below, the estimator is y^-consistent and asymptotically 
normal 

Consider now a Bayesian analysis of the problem. To simplify the discussion, 
assume that the Xi are sampled from an absolutely continuous distribution F on 
[0, 1]'' with known density f{x). As / and g are both known, the only remaining 
parameter is ft., where h{X) ~ 'Ej{Y\X). Let tt be a prior for h with respect to 
some measure /z. The joint density of h and the observations is given by 



p(/i,W) = 7r(ft) W hiX,f' {1 ~ hiXijf 

n 

X Xlg{X,r^{l~g{X^:)f 



as Zi = Yi when Ri = 1. But this means that the posterior for h satisfies 

7r(/i|W) oc 7r(ft) n h{X,)J{l-h{X,)f-''^. (1) 

i:Ri=l 

Of course, this is only a function of those observations for which i?^ = 1, for 
which the Yi are directly observed; that is, the observations for which Ri — 
are deemed uninformative. The difficulty with this restricted point of view is 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 5 

quite simply that the Bayesian can only make use of the information contained 
in (1). However, (1) is independent of g. Hence, any procedure which depends 
on g, for example, the Horvitz-Thompson estimator, cannot be used in this 
analysis. The Bayesian is restricted to estimates of i? determined by estimates 
of h and, when d is large, estimating h can be very difficult. Indeed, if we 
assume that h is Holder continuous with constants M < oo and a > (i.e., 
supj^Q supq<„<i_( t~°'\h{u + t) — h{u)\ < M), we need a > d/2 for our estimate 
of 1? to be Y^-consistent. If d is large, this is a very restrictive assumption. 

If much is known about h, for example, there is a finite-dimensional paramet- 
ric model for h, then the Bayesian paradigm runs into no particular difficulty. 
And, as above, similar claims can be made for less restrictive specifications. If, 
however, the problem is nonparametric and we wish to impose only minimal as- 
sumptions on h, the only available estimator is the Horvitz-Thompson estimator 
(or an estimator which is asymptotically very close to it), and such estimators 
are not available to the Bayesian nonparametric statistician. 

Consider now the case in which neither g nor h is known, and these parameters 
are assumed a priori to be independent, with joint density n{h)p{g) relative to 
some dominating measure. In this case, the posterior is the product of a term 
depending on g and a term depending on h, and information about g cannot 
be used by the Bayesian nonparametric statistician to construct estimates of 
d. Unfortunately, it can be very difficult to construct reasonable estimates of g 
when X is high-dimensional. 

If, under the prior, we assume g is sufficiently smooth, then the posterior can 
be used to obtain consistent estimates of g which in turn yield -yn-consistent es- 
timates of i?. But the rate at which g can be estimated is only -fj~"/(2"+<i)^ where 
a is a measure of smoothness and d is the dimension. On the other hand, the 
frequentist Horvitz-Thompson estimator is efficient over all smoothness classes 
for g. 

Regardless, if g is unknown, h cannot be estimated in general! This is true 
even in the one-dimensional case. Suppose X is distributed uniformly on the 
unit interval and g is given by 

1 1 ""^ 
g{x) = 2 + 4 X! *' ^ ('^^ ^ *) ' 

i=0 

where m = m,„ = n^; the sequence si,...,Sm G {—1,1} is assumed to be 
exchangeable with '^Si — 0, and i^{x) = 1(0 < x < 1/2) — 1(1/2 < a; < 
1). Furthermore, assume that h{x) = 17/64 or h{x) = g{x). With probability 
converging to 1, there will be no interval of length 1/m with more than one Xj. 
However, given that there is one Xi S (j/rn, (j -I- l)/m), then the distribution 
of {Ri, Zi) is the same whether h{x) = 17/64 or h(x) = g{x), and hence d is not 
identifiable, and can be either 17/64 or 1/2. 

Of course, in general, it is possible to trade smoothness in g for a lack of 
smoothness in h and vice versa, to construct estimates of i? but smoothness 
assumptions in high dimensions tend to be restrictive. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 6 

Note that this argument shows that any estimator which is based only on the 
hkehhood function, ignoring auxihary information which is not part of it or the 
parameter space, fails in this setup. In particular, this includes the maximum 
likelihood estimator. 

However, we can construct a y^-consistent Bayesian estimator if it is based 
on the a priori unknown g. An example is given in Appendix B. 

This argument mixes Bayesian and non-Bayesian techniques. Our goal is 
to make the argument precise and to study its impact on understanding the 
meaning of Bayesian inference in complex, high-dimensional models. 

3. The Partial Linear Model 

In this section we consider the partial linear model, also known as the partial 
spline model, and originally discussed in Englc et al. (1986); see also Schick 
(1986). In this case, we have observations Wi — {Ui,Xi,Yi) such that 

y, = dX, + g{Ui) + e, 

where [Xi, Ui) are i.i.d. samples from the joint density p(a;, u), relative to Lebesgue 
measure on the unit square, [0, 1]^; g is an element of some class of functions, Q] 
and the e^ are i.i.d. Af(0, 1). The parameter of interest is d and g \s a, (possibly 
very non-smooth) nuisance parameter. Let h{U) = E (X U). For simplicity, 
assume that U is known to be uniformly distributed on the unit interval. 

3.1. A Frequentist Analysis 

The loglikelihood function is 

^(^,g,p) = - ^^''7^^"^^' -logK:^,.). 

It is straight forward to argue that the score function for -d (the directional 
derivative of the log-likelihood in the least favorable direction for estimating i?) 
is given by (cf. Schick (1986); Bickel, Klaassen, Ritov and Wellner (1998)) 

U {^, g) = {x- h(u)) {y--dx- g{u)) = {x - h{u)) e, 

and the semiparametric information bound for the estimation of -d is 

/ = EVar(X|t/). 

We assume that / > 0. In particular, this implies that X is not a function of U . 
Under some regularity conditions an efficient estimator can be constructed 
along the following lines. Find initial estimators h and g oih and g respectively, 
and estimate i9 by computing 



E (X, - h{Ui)) (Vi - ~g{U,) 



s( 



X, - h{U^)) 



2 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 7 

The idea is that i? is the covariance between X and Y conditional on any given 
values of U and this estimator is based on the assumptions that the conditional 
expectation of X and Y given U arc smooth enough, and have a fair estimators 
h and g respectively. 

We could, for example, assume that the functions g and h satisfy Holder 
conditions of order a and and 7, respectively. That is, there is C < 00 such 
that \g{v) — g{u)\ < C\v —- u|" and \h{v) — h{u)\ < C\v — u\'^ for all v,u in 
the support of U. We could also assume that Va,r{X\U) has a version which is 
continuous in u. In this case, so long as a + 7 > 1/2, and / > 0, we can construct 
a -^/n-consistent and semiparametrically efficient estimate of {). 

An estimator that tries to push the smoothness assumptions to the absolutely 
weakest necessary is the following: 

Let W^(i) = (C/(i), X(j), Y(j)), i ~ 1,2, . . . ,n he the sample ordered such that 

U(i) < f7(2) < .... Write X = h{U) + Z. Note that E (^(»+i) -U(^)Y = 
Op {n^^j under the assumption that the Ui are uniformly distributed on [0, 1], 



while n ^ E (^(i+i) ^ ^(i)) — ^ c > under the assumption that / > 0. Take 



i9 = 



E(^(j+i)-^w) 



where 



R 



E {^(i+l) - X(i)) 



E (^(i+1) - X(i)) 

E (^(»+i) - %)) {gJUji+i)) - gJUjt))) 

E (^(»+i) - ^(»)) 

^ E (fe(t/(.+i)) - fe(t/(.))) {9{Ui^+l)) - g(t/(,))) 

Op 



because the Zi are uncorrelated with the Ui. On the other hand, 

^ E(^(.+i)-^w)(£(.+i)-^w) J)^ ^ (q^ 3j_ 
yAXu^n-X(,y V 2 



E(^(»+i) --^wj 

We conclude that i? can be estimated in a y^n rate if /i is Lipschitz of order 
7, g Lipschitz of order a, and a + 7 > 1/2. This estimator is not efficient, but 
it does show what the minimal local smoothness conditions are. We want to 
remark that not all pairs of observation are needed, a subset of size of order n 
may suffice. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 8 

For the sake of completeness, we construct an efficient estimator. Let c„ — > 
slowly, and choose three random sub-samples of size Cnfi. Based on the first 
sub-sample construct an initial estimate of •&; estimate g using the second; and 
h using the third. Denote these estimates t?, g, and h, respectively. The non- 
parametric components g and h are identified as E(Y — dX\U) and E{X\U). 
Both can be estimated using kernels (with -d plugged in for i?) with bandwidth 
CT„ ^ slowly enougth that c„(T„n -^ oo. Then E{g — g)^ = 0(n~^"+^''), 
E{h — hY ~ 0(n^^'^+^''), with v arbitrary small. Let S be the remainder of the 
sample. Calculate 

^^ Y.^es{X^^^U^)){Y,-~gm) 

^ 'Z^es {^^ - MUr)) {^X, + g{U,) + e, - gjUj)) 

T.^es{X^-h{U^)Y 

E^es{X^-hmy E^es{X^-hmy 

= r+ op(n-i/2) + 0^ [ n-1 ^(/^(L/,) - MU^)) (gm - ~g{Ui)) | . 

\ ies ) 

since the initial estimators are independent and independent of S. We conclude 
that ?9 = -i? + Yliii^i — h{Ui))ei + Op(n^^/^), as required. 

3.2. Minimal Smoothness 

Consider the sub- model where {■&, g,h) gM. x Am x Am'. 



{h -.hiu) ^y^ Ci{vi+i - Wi)"'(/'( —), 

uG(0,f), = uo < wi < • •• < Wm+i = 1, Cj e M,max|ci| < M} 
where 

^{U) = rl[o,l/2)(") - (1 - t)"l[l/2,]H. 

Clearly, if (g, K) G Am x Am, they are Holder of order a. Suppose nmaxJMj+i — 
■Ui} — >■ 0, then the constants ci, . . . , Cm that define g and h cannot be estimated 
consistently, and hence if the w's appear behave like a size m random sample 
from a uniform distribution, ||ft. — /ip = Op(?TT.~"). 

Consider now a semi-Bayesian version of the problem, where rn ~ n^^'^, 
V > 0, wi, . . . , Vm are the order statistics of a sample of i.i.d. U{0, 1), ci, . . . ,Cm 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 9 

are independent, the Ci in the definition oi Am are iV(0,T^) for h and N{0,rj'^) 
for g, with correlation p between the two, and Z = X — h{U) is a N{Q,v'^) 
random variable. Finally, e-i ^ N{0, a'^). Then we observe pairs 

Xi — hi + Zi 
Y, ^'d{h^ + Zi)+gi+e^. 

Hence they follow a bivariate normal distribution with covariance matrix: 

v'^+t'^ 'd{v'^ + t"^) + pT-q 

d{v'^ + T^) + pT-q a"^ +rj^ + 'd'^{v'^ + r^) + 2dpTri 

Any estimator of the estimator is equivalent to solving the empirical covariance 
matrix (which yields 3 equations) for the 6 parameters {d, i^^,<t^, p, T.-q). If a < 
1/4 then t^,?72 3> n^^'^, hence the expression prrj cannot be ignored, and § 
cannot be solved to the n^^/^ accuracy. 

3.3. A Bayesian Analysis 

We want to consider a Bayesian approach. Suppose that the Bayesian has an 
independent priors on p{u,x), g and i?, tt = tt^ x tt^, x tt^. For example, the 
first distribution may be a function of the environment, the prior on the non- 
parametric component of the regression function is a function of the physical 
process and the third component of the prior is about our understanding of the 
measurement engineering. The log-posterior is then 

A - -"^(y, - ^X, - g{Ui)^ + 7r^(i9) + 7rg(<?) + ^ \ogp{u, x) + 7rp(p). 

i=l i=l 

That is. when the Bayesian comes to estimate d, he does not see any informa- 
tion about h. The same estimator would be used whatever is known about the 
smoothness of h\ 

Suppose now that essentially it is only known that g is Holder of order a, 
while the range of U is divided to some intervals, such that h is either Holder 
of order 71 or of order 72 where 

1 

a + 71 < - < a + 72. 

Then a ^Jn consistent estimator should only use the intervals where h is Holder 
of order of 72. The rest should be discarded. If the number of observations 
in the "good" intervals is of the same order as n, then the estimator is still 
^Jn consistent. For a frequentist, there is no difficulty in ignoring the nuisance 
intervals, d is assumed to be the same all over. However, the Bayesian cannot 
ignore these intervals. In fact, his a posteriori distribution does not contain any 
information which intervals are good and which are bad. 

There is no logical contradiction. The type of parameters combination that 
the Bayes estimator fails on, has negligible a priori probability. He assumes 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 10 

that the a priori g and h are independent, and short intervals are essentially 
independent (to take care of the very rough g and h we deal with) . Under these 
assumptions, the intervals with h Holder of order 71 contribute on the average 
0. But this average is by the prior, which was conveniently constructed by the 
Bayesian. 

4. The white-noise-model and the plug-in property of Bayesian 
estimate 

We consider the white noise model: 

Xi = Pi + £i, 

where /3 = (/3i,/32, . . .) E B C £2, and £1,62, • • • sue i.i.d. N{0, 1/n). This model 
is called the white noise model because its equivalence to the model dX{t) = 
jiit) + n~^/'^dW{t), t € [0,1], ^ S L2, and W a standard Wiener process, by 
taking Xi, X2, ■ ■ ■ and /3i, /32, • ■ • to be the projection of X{-) and /i(-) on some 
orthonormal basis of ^2(0, 1). We consider the estimation of f3 as an object of £2 
with a squared norm loss function ||/3— /3|p, and estimation of a linear functional 
of /3, h{0} ^ Y.'iLi CiPi, hen = {h{f3) = Y^tli CiPi ■ (ci, C2, . . .) e £2}, again 
under the error squared loss function. 

To be more specific we consider iJ^ = {/3 : \l3i\ < i^"}, a > 1/2. From 
a standard frequentist point of view the estimation in this problem is simple 
enough. Simple estimators that achieve the optimal rate of convergence are 
given in the following proposition: 

Proposition 4.1 The estimator h ~ "^ hiXi is ^/n consistent for any h d H. 
The estimator 

A ^ JX, i" < ni/2 
^* [0 i"> ni/2 

achieves the minimax rate of convergence, j2-(2"-i)/2a^ 

The proof is in Appendix C. 

A major characterization of the Bayes procedures is that they have necessarily 
the plug-in-property (PIP). Since 



E/i(^)=^qE^, 



we have h{(3) — h{i3), for any Bayes estimators of h{l3) and /3, respectively, 
both under quadratic loss function. 

However, there is no efficient estimator with PIP in the white noise model 
as is shown in Bickel and Ritov (2003). Every estimator would fail either as a 
nonparametric estimator with an optimal rate, or as a plug-in-estimator (PIE) 
of at least one linear functional. The argument of Bickel and Ritov (2003), being 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA? 11 

valid to any estimator is not strong enough for our purpose. However, we can 
strengthen it for Bayes procedures. 

We need the following lemma, whose proof is given in Appendix C. 

Lemma 4.2 Suppose X - N{'&,a^), \-d\ < a < a. Let S = SiX) be the a 
posteriori mean when the prior is tt. Let b^ be its the bias under 'd. Then \b^\ + 
\b-^\ > 2(1 — {a/a)'^)\'&\. In particular, if tt is symmetric around 0, then \b^\ > 
{l-ia/afM. 

The proof is in Appendix C. 

This lemma shows that any Bayes estimator is necessarily biased and puts 
a lower bound on this bias. We will use this lemma to argue that any Bayes 
estimator is going to fail for some simple functionals. 

Theorem 4.3 For any Bayesian estimator (3 with respect to prior on Ba, ol > 
1/2, there are h E H and jS G Ba such that n(h{(3) — h{l3)) — ;■ oo. In fact, 
E^ /i(/3) - h{f3) = 0(n-2"-i/4«). 

Proof. It follows from Lemma 4.2 that for any i > 2n^^'^" there is Pi such that 
if 6, = E^, - 13, then \bi\ > 3i-"/4. Define 

'O i < 2ni/2" 

Q = { Cin(2"-i)/4" r" i > 2ni/2" k bi> r"/2 

-Cin(2"-i)/4" i-" i > 2ni/2" k bi< -i""/2, 

where Ci ensures that X]fc:i '^? = 1 (note that C\ is bounded away from and 
oo). Hence 

E^c.(A-A)>^Cin(2-i)/4" ^ e-2" 

D 

That is, any Bayesian estimator fails on some pairs /3 and h. These pairs are 
not strange animals. Actually they are pretty 'typical' members of Ba and "H. 
What makes them special is only that the sequence /3i,/32, . . . is not ergodic, 
and similarly /ii, /i2, . . . is not. Each of them have a non-trivial auto-correlation 
function, and the two auto-correlation functions are similar. The prior makes 
such pairs unlikely, and the biases of the estimator of each of the components are 
going to cancel each other by the prior. If the a priori distribution is presenting a 
real physical phenomena, this exact cancelation, due to the law of large numbers 
is reasonable, and the statistician should not worry about it. If the prior is a 
way to express ignorance, or beliefs — subjective beliefs — than one should worry 
about these small biases. Certainly so, if the only reason to assume that small 
terms are not going to accumulate is based on mathematical convenience of 
expressing rough ideas about the unknown parameters. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 12 

The parameter /3 and the functional may be similar because of phenomena 
such as the one presented in Appendix A. In a large space, the autocorrelation 
function may be complex with an unknown neighborhood structure, and in 
practice, completely hidden from the observer. 

We consider a Bayesian model to be honestly nonparametric on Ba, ii C{(3i | X^ 
is symmetric around 0, and P{f3i > ji~°' X-i) > 7, for some 7 > 0, where 
X_i — Xi, . . . , Xi_i,Xi^i, .... That is, at least in some sense, the components 
of f3i are free parameters. We have: 

Theorem 4.4 Suppose the prior is honestly non-parametric on Ba and 1/2 < 
a < 3/4, then the Bayesian estimator of h{(3) — X]i=i ^i/^i is not consistent, if 
\ci\ = i^", (3 and h are serially correlated with bounded away from correlation, 
and V'^Si^m A^ ~^ ^^ (which is the a.s. the case under the prior). 

Proof. Again, we consider the bias as in the second part of Lemma 4.2: 
VnE V c,(A-ft)= V d,c,/3„ \d,-l\<n-r 



j>„v+l/2a j>„v+l/2a 



n 



j^.l. An example 

Here is a simple simulation. For the vector /3 we considered the image given 
in Figure 1(a): the figure is a gray scale image of a 367 x 300 matrix, whose 
vectorization is the vector /3 e M^^'^'^°'^. That is, if the gray level of the im- 
age represents /„, then /3 = / = (/i,i,/2,i, • • • ,^367,1,-^1,2, ■ ■ • ,/367,30o)- To 
obtain X we added to each pixel an independent A^(0, 169) random variable. 
See Figure 1(b). We emphasis that we do not consider fi and X as images, 
but as vectors with exchangeable components. The Bayes estimator was calcu- 
lated with respect to prior which considers the components as i.i.d. iV(/i,T^), 
where \i = '^Wil3i/^Wi, wi,W2,--- are i.i.d. C/(0, 1) random variables and 
T^ = 315.786 is the true empirical variance of /3i, . . . , /3iioioo- The resulted SNR 
is low (-2.72db). The nonparametric Bayes estimator of the /3 is given in Figure 
1(c). It is closer to the true image than the noisy observations, as expected, 
since the prior is a honest exchangeable description of the data. 

The purpose of the estimation was not the nonparametric estimate per se, but 
in the spirt of this section, an estimation of a functional. The functional h is given 
in Figure 1(d). Again, the image is a gray scale representation of ci, . . . , cuoioo- 
The two images were selected from the small collections of images supplied by 
the standard distribution of Matlab, and this pair was selected because their 
sizes fitted. Thus we have two processes on the unit square. One represents {(3i}, 
while each pixel in the jet image, represents the value of hi. In both cases, the 
image is an image of an object, and therefore has a center and margins. There 
is a strong correlation between the point of the picture, above being continuous. 
The vector of parameters and the stopping times are correlated, being referred 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 13 

to images that follow the same rules of good image, rules hat arc not necessarily 
known to the data analyst. 

Applying h to the noisy X yields an estimate with RMSE (root mean squared 
error) of 1.04. Applying h to the much cleaner Bayes estimator gives RMSE of 
19.01. The main difference between these two estimators was in the bias (0.01 
versus 19.00). The bias and RMSE calculation were based on 500 Monte Carlo 
simulations. 

The Bayes estimator does not fail because the object of interest, /3, and 
the functional are not independent. They are independent. There is no reason 
to assume that the bone structure of the image representing the functional has 
anything to do with the jet imaged in the object. It did not fail because no image 
analysis tools were used. Smoothness of the picture is far from being relevant to 
the failure. They failed because the prior failed to recognize that the images are 
not permutation invariant or ergodic, and hence two images may be correlated, 
positively or negatively by chance, but correlated. See Appendix A. In fact, this 
is typically the case with two good pictures. A good pictures has a structure. It 
has a center and it has margins, it is not a mixing process. Now, with pictures 
this is easy to understand in retrospect. Not that easy to understand a priori. 
But pictures are two dimensional and at least can be viewed. With complicated 
graphs, which human beings cannot viewed and understand, but may be non- 
mixing and with clear (not well understood) structure, the same situation could 
happen, bias would be introduced into the Bayes procedure, but the Bayesian 
may fail to understand, and the prior that expresses his subjective belief on the 
subject would fail to protect him against bias. 



5. Estimating the signal squared, and the importance of being 
unbiased 

We continue with the analysis of the white noise model of Section 4, but we 
consider a different Euclidean parameter of interest: •& — X^i^i l^f- 

A natural estimator of Pi is given by Proposition 4.1, and one may consider 
as an estimator of the parameter ^ = J2K = Si<ni/2a ^f- This works fine 
when a > 1. It achieves both the minimax rate for estimating /3, and ?9 is an 
efficient estimator of the Euclidean parameter. But /3^ has a bias of n~^ as 
an estimator of /3f, which accumulate to 7j-i+i/2a ^ n^^/^ when a < 1. The 
simple traditional correction is to unbias the estimator, cf. Bickel and Ritov 
(1989): 

Proposition 5.1 Suppose a G (3/4, 1), then an efficient estimator of d is given 
by 






Kin 



for ni/(4a-2) ^ra-^ 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 14 

Proof. Clearly the bias of the estimator is bounded by X]j>m *^^" ^ m^^^"^^) = 
Op{n^^^^), and its variance is bounded by X]i<m(4/3i^/" + 2/?^^) = 4:-d + 0p{n^^). 

D 
However, this is a standard frequentist approach. There is a problem, and a 
solution is justified because it works, and not because it fits a paradigm. The 
solution works because we see that bias accumulation is the issue and we can 
deal with it. 



5.1. The Bayesian analysis 

Consider the above situation with a E (3/4, 1). Then the estimator suggested in 
Proposition 5.1 sums at least n^/*^^"^^' terms. Note that most of the terms, all 
beyond the first n^'^", are deeply under the noise level! This creates a problem 
for Bayesian analysis. 

For any prior on f3i, i > n^/'^°'+'^ with h' as small as needed and m = yT,i/(4"-2) 
as in Proposition 5.1: 



where 



'^"" ■^2 /v + \2| P , 



max log A, < max -\{X, - hY - {X, - ta)^! -^ 0, 

nl/2 + ^<i<„l/4o-2 nl/2 + .^<j<„l/(4o-2) 2 

since maxj<„ \Xi\n~^^''^~'^ — > 0. But this means that all the tail of /3,^ for 
i > n^' ^" is replaced essentially by its a priori mean. Since this tail may carry 
signal of order 72-(2"-i)/2a ^ ^-\I2 ^ ^j^^, Bayes estimator is not consistent. 

Where is the difference between the Bayes estimator and the frequentist es- 
timator of Proposition 5.1? Both try to be unbiased. For the frequentist this 
mean that whatever is the value of the parameter, the estimator has expecta- 
tion which is very close to the estimated parameter. The Bayesian however is 
unbiased with respect to his prior. Thus it is easy to him to replace whatever 
is difficult to estimate by its expectation according to the prior. This makes his 
estimator inconsistent in the frequentist sense, and inconsistent in any regular 
sense if the estimator does not describe exactly the generating mechanism of 
the data. 



6. Data dependent sample size 

The stopping rule principle says roughly that Bayesian inference should not 
depend on any stopping rule used to obtain the data to be analyzed, as long as 
it was done using stopping times. Formally, Berger and Wolpert (1988) wrote: 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 15 

"Stopping Rule Principle (SRP): In a sequential experiment E^ , with observed 
final data a;", Ev{E'^ , x") should not depend on theThe proof is in Appendix C. 
stopping rule t." 

We challenge how this principle works with high dimensional data. We con- 
sider another version of the white noise model. We consider a finite version of 
it, with n-2" < /3, < Sn^^a^ i :^ l^,,,^k = [n^^J, and 1/6 < a < 1/4. The 
■ith component is a Brownian motion with drift /3i, Xi(-) is observed until time 
Ti. Let Xi{t) = Xi{t)/t, the sufficient statistic for f3i given {Yi{s) : s < t}. 
Of course, Xi is also the MLE. Finally, let tt^ be the prior distribution of /3i 
given X^i, the set of all observed components other than Xi. Let /j(-) be the 
distribution of Xi{Ti) given X^i (i.e., /^ = tt^ * iV(0, l/T^)). We assume that 
the prior is nonparametric in the sense that tt^ is bounded away from on the 
permitted support, thus the rest of the data does not reveals too much on Pi. 

It is well known that the posterior mean of Pi satisfies 

Ji Ji(^i[-i-i)) 

If Ti ~ Op{n), then /^ « tt^. Further, Xi{Ti) « Pi. Suppose Ti is correlated 

with fi/fiiPi), then the MLE of J2i=i f^ij Xli=i ^i{Ti), has a random error of 
order n°'n~^^^, while the Bayes estimator has a bias which is Op(n^"n^")/n) 
(there are n^" terms, each one of them of size n^" due to T^^fi/fi, and a factor 
of 1/n due to T). In the range of a we consider, the Bayes bias dominates the 
random error! 

Consider now the stopping time: 

T, = inf{t : X,{t) = nPoA, + Z,Vt}, 

where Zi is a A'^(0, 1) variable, Ai is an independent variable, whose values 
are under the control of an adversary, who is ready to tell their values to the 
statistician, but if the latter is a Bayesian, he simply ignore the former. The 
adversary is only restricted to have E{Ti\Ai) = ilp{n), which is the case if 
A, = Op(l). 

All agree that {Ti, Xi(Ti)) are sufficient. In fact, the situation is more extreme. 
The distribution oi Xi{Ti)\Ti is independent oi Pi, and hence either Ti or Xi{Ti) 
is sufficient by itself! 

This is a trivial statement for the frequentist who knows Ai. The Bayesian, 
cannot distinguish between this stopping time, and the situation in which Ti 
is endogenous, and the distribution of Pi given Ti — t is the distribution of 
nPfjAi/t (e.g., a point mass). Alternatively, his estimator is biased whenever, Ai 
is empirically correlated with /3j. 

6. 1 . Example 

We consider again the same vector /3 represented in Figure 1(a). But this time 
the spine image of 1(d) is giving the sample size per component. Noise was 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 16 

added to obtain Figure 2(a). The SNR now, as can seen, is much higher than 
before (+2.72db). As a result the Bayes estimator given in Figure 2(b) is much 
smoother. 

The prior was again with independent normal components with mean equals 
to the mean value of f3, and variance to its true empirical variance. Each pixel 
in the image was observed until a stopping time which was proportional to the 
gray level of the corresponding pixel in the spine image. Figure 1 (d) . Thus we 
have two processes on the unit square. One represents {ft}, where each pixel in 
the jet image, represents the value of ft. The process is of the stopping time, and 
is given by the spine image. In both cases, the image is an image of an object, 
and therefore has a center and margins. There is a strong correlation between 
the point of the picture, above being continuous. The vector of parameters and 
the stopping times are correlated, being referred to images that follow the same 
rules of good image, rules hat are not necessarily known to the data analyst. In 
500 Monte Carlo simulations the RMSE of the mean Bayes estimate was 0.05 
compared to the mean of the MLE RMSE which waas 0.009. The difference was 
almost all because of the bias. If we replace the stopping time with a fixed time, 
the average of the above, then the Bayes estimator is slightly better (RMSE 
of 0.0071 versus 0.0072). Thus, the example justifies the claim that the Bayes 
estimator failed when the stopping rule and the parameters values happened to 
be serially correlated. 

7. A positive result and a summary 

We start with a version of the Doob's consistency result, which shows that 
the existence of a uniform y^ consistent estimator ensures that the posterior 
distribution is -^/n consistent with prior probability 1. 

To simplify notation we consider in this section the Markov chain 770 — > X„ — > 
?7„, where 770, r]„ eJi, r/o ^ tt, Xn ^ Pr,o, and given X„, r]o and ??„ are i.i.d., i.e., 
given Xn, rjn is distributed according to the a posteriori distribution 7rx„- Let P 
be the joint distribution of the chain. In the following dn is a semi-metric on the 
parameter space, normalized to the sample size. Typically, in the nonparametric 
situation considered in this paper, dn{r],ri') — ^Jn\d{fii) — ^{ri')\ for some real 
functional -d of the parameter. 

We consider an estimator ?7„ to be dn consistent uniformly on H, if for 
all £ > there is M < oo such that for all 77 G "H and n large enough, 
Pri{dn{fin,v) ^ ^) ^ £• Thc postcrior is dn consistent uniformly on H if for all 
e > and S > 0, there is M < 00 such that for all 770 € "H and n large enough, 

Pvo{T^xJdn{Vn,Vo)) >M)>s)<d. 

Theorem 7.1 Suppose there exist a dn consistent uniformly on % estimator. 
Then there is aH' <Z'H such that 7r('H') = 1 and the posterior is dn consistent 
onW. 

The proof is given in Appendix C. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 17 

Thus, the existence of a uniformly good frcquentist estimator ensures that the 
Bayes posterior is concentrated in the right rate under all parameter values that 
seemed relevant under the prior. This claim does not contradict our findings. In 
the CODA and PLM examples, the difference between the Bayes estimator and 
the frequentist one, is that the former ignores the information that restricts the 
model to a subset of prior probability 0. In the quadratic function of the white 
noise model, the demand of the prior of being "honestly non-parametric" , lim- 
ited /3i,/32, • • • to regular sequences obeying LLN's, and hence any non-ergodic 
sequence are in a set with probability 0. Finally, in the linear functional ex- 
ample, each prior fails for each linear functional on a set of parameters with 
probability 0, but if the linear functional and the parameter arc chosen together 
as we argue it may happen, the theorem has no consequences. 

In this paper we presented a few examples in which a nonparamctric prior fails 
to estimate simple parametric functions at rate n^^'"^ even though frequentist 
efficient procedures exist. In this examples the assumed smoothness was mini- 
mal, but we do not believe that this is essential. With minimal smoothness it 
was easy to prove that the error explode. With smoother objects it would be 
more difficult to prove and estimators would be just not optimal. 

Bayes procedure are always unbiased with the respect to the prior they are 
based upon. The Bayes estimator tends to replace elements buried inside the 
noise with their a priori mean. This would be a reasonable strategy if the prior 
represents a physical reality. If the prior represents subjective belief, not to say, 
a subjective belief based upon the need to have a prior that can be handled 
easily for highly plausible values of the parameters. 

What were the phenomena that were exemplified in our models? 

1. Spurious correlation. A possible empirical cross-correlation between two 
independent processes. The Bayes estimator ignores it. This happened in 
the CODA example of Section 2, the partial linear model of Section 3, 
the linear estimator of Section 4, and the stopping time story in Section 
6. Since this correlation has expectation 0, the Bayes estimator is on the 
average unbiased, but this is being unbiased only with respect to the sub- 
jective probability. It is biased in any other sense. 

2. The Bayesian is required by his paradigm to plug-in the same estimator in 
estimating all functionals of a non-Euclidean parameter under quadratic 
loss function. The non-existence of a PIE, a universal estimator that can be 
plugged-in, makes the Bayesian paradigm too inflexible. This was shown 
in Section 4. 

3. The fact that the Bayes estimator assumes that elements that a priori 
have mean 0, can be considered without harming the final results, played 
a rule in the failure of the Bayes estimator in the partial linear model of 
Section 3. 

4. On the other side of the previous point, replacing signal buried deeply 
by the noise with 0, may bias the estimator when the components of the 
signal can be estimated without bias and accumulated without a bias and 
bounded overall variance. See the example of Section 5. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 18 

5. Conceptually, the strict likelihood principle (cf. Berger and Wolpert (1988)), 
causes the Bayesian to ignore auxiliary information that may be used to 
unbiased the estimator. This was in the center of the argument of Section 2, 
the CODA example (the missing probability). Section 3, the partial linear 
model (the information about the roughness) , and the stopping procedure 
in the stopping time example of Section 6. 

Real life examples are more complex and less traceable than the toy problem 
we played with in this paper. As a result, it would be harder to understand 
what are the subtle implications of the assumptions hidden in the prior. It is 
very hard to build a really complicated prior. The typical researcher would use 
a prior in which there is a lot of independent component. However, with many 
independent or semi-independent component laws like LLN and CLT take effect, 
and thus what was supposed to be a vague prior is concentrated in a small corner 
of the parameter space. This invariably would impinge of different estimation 
problems. 

Appendix A: An auxiliary result: independent but correlated series 

Much of the analysis in this paper is based on presenting counter examples of 
parameter values on which a given procedure fails. This is satisfactory from a 
minimax frequentist point of view: one example is enough to argue that the result 
depends on the unknown parameter and is not uniformly valid, or asymptotic 
minimax. However, this may not convince the Bayesian, which may claim that 
the counter example is a priori unreasonable. A typical example of the argument 
was presented in the CODA example of Section 2. It can be characterized by 
having two a priori independent processes (p and g in the example), which 
happened to be "similar" . One may thought that for the Bayesian this is a very 
unlikely event. After all, he assumes that they are independent, one depends on 
the biology and the other on the budget constraints. In this section we argue 
that actually this can be a likely event. Thus, Harmeling and Toussaint (2007) 
write: "Let us now get to the core of Robins and Ritov [1997]. The authors 
consider uniform unbiasedness of an estimator. This means that the estimator 
has to be unbiased for every possible choice of 9 and £,. In the experiment we 
performed above, though, we chose and ^ independently and thus it was very 
unlikely that we ended up with an accidentally correlated 9 and ^, e.g., where 9 
tends to be large whenever also ^ is (or inversely)." We claim that this criticism 
is ignoring the fact that two processes can be independent, while most likely, 
have high (cross-)correlation. This would be the case if they are not mixing, and 
have similar autocorrelation function. We elaborate on this in this appendix. 

Suppose Ui, . . . ,Un and Vi , . . . , y„ are two independent simple random walks. 
Then of course C/„ and Vn are uncorrelated. But we may consider the correlation 
between these two series R = n^^ X]"=i(^i ~ Un){Vi — 14), where J7„ and Vn 
are the empirical means of the two series respectively, i? is a random variable, 
and clearly it has mean 0. However, it is far from being close to even if n is 
large. In fact asymptotically it is distributed almost uniformly on most of the 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 19 

interval (—1, 1). McShane and Wyner (2011) use this argument to argue against 
some standard analysis of historical data in climate research. The reason for 
this some what surprising fact is that random walks and Brownian motions are 
less wild than they are some time pictured. In fact given [/„, the best guess of 
U[n/2\ is Un/'i., and the sequence tends to be, very roughly speaking, monotone. 
But if both Ui,. . . ,Un and Vi, . . . ,Vn are "somewhat" monotone, then they 
are serially correlated, maybe positively correlated, maybe negatively so, but 
usually not uncorrelated. 

Consider now two general independent mean random non-mixing sequences 
Ui, . . . ,Un and Vi, . . . ,Vn Suppose that the two sequence have some autocor- 
relation functions A{i,j) and B{i,j). We do not assume that the series are 
stationary, and we do not know their autocorrelation function. The picture we 
have in mind is that each ([/j, Vi) is a characteristic of points in a large graph, 
and neighbor nodes are highly correlated, but we do not know the neighborhood 
structure of the graph. Let 

_. n ^ n n 

R = x-cov(C/. , V:) = - ^ C/,F, - — ^ (7, ^ F„ 

i— 1 i— 1 i— 1 

where x-cov stands for Cross COVariance. Then ER = 0, while dhect calcula- 
tions give: 



^ n n 

Var(i?) = ;^EE^(^'j')^(*'j') 

i=l 3 = 1 
_j n ^ n ^ n 

= -^x-cov(A(z,.),B(*,-)h)-x-cov(-^yl(.,j),-^i?(.,j)). 
1=1 i=i i=i 



To get an impression suppose that n ^X]?=i^(*ii) = '^ ^ X]?=i -^(*ii) — '"■ 



^ n n 

Var(i?) = -^ E E(^(^'-?') - c) (B(^, j) - c) 



n- 



Clearly if the two series are mixing, and ^ ^(*, j) = X^j B{i,j) = 0(1) then 
Var(i?) = 0(n~^). However, if they are not mixing, and have similar autocor- 
relation functions, then most realization of these two series would be serially 
correlated. 



Appendix B: Bayesian estimator for the CODA model when the 
weight function is known 



Let 



Gj = ^i: gi^ g{x,) G {jmn ^/^ {j + l)mn ^^^)^ 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 20 

for some constant m. The prior is defined so that hi is constant on Gj , sampled 
from a non-informative prior, such that the values hj on the different sets are 
independent. In this case: 

n^'^E i^h^ I data j =n^^E ^hj\Gj\ data 

But the fraction on the RHS is the proportion of ^^ Ki put randomly in cither 
{i : Fi = f , 2 e Gj} or {« : Yi = 0, J e Gj} which falls in the first of these sets. 
If gi was really constant over Gj, this would correspond to a hypergeometric 
distribution. Since g is not really constant over Gj ,we should add an error term 
and get 






E^-|G.-l|=%rl^^-l + ^ = E^' + ^- 



|G,i 



where |Aj| < m\Gj\n ^/^. Since the cells are independent, the Bayesian esti- 
mator is Y^-consistent. 

Appendix C: Proofs 

Proof of Proposition 4-1- Clearly 

oo 

i=l j>ni/2a 

< ^-(2a-l)/2« ^ Y^ .-2a 

i>„l/2<. 

< ^^ ^-(2a-l)/2a 

- 2a - 1 

That this is the minimax risk is established by considering the Bayes prior which 
makes /3i, /32, • • • independent, P{Pi = ±i~") = 1/2 D 

Proof of Lemma 4-2. First note that because of the monotone likelihood ratio 
property, 'd{x) is a monotone increasing function of x. 

l + bt = -^KgE^{e\X) 

_d_ /te~(^-')'/2"'d^ 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA? 21 

where E^ is the expectation assuming the true value of the parameter is d, and 
{@, X) is a pair of random variables such that ~ tt, and X|8 ^ ^(0, cr^), and 
E^ is the expectation under their joint distribution. Note that E^ is a formal 
expression, since we assume that X ^ 7V(i?,cr^). Let Z ~ N{0,a'^) then 

^ + * " di)^ /e-(^+''-*)V2^^d^(t) 

1 f/t(t-Z-?9)e-(^+''-*)'/2-'d7r(t) 

/te-(^+'^-*)'/^"'rf7r(t) /(t - Z - ^)e-(^+^-*)V2"'d^ft) -( 

= ^E^{Var(e I X)}, 

Hence < 1 + fe^ < (a/a)'^, or 6^ e [-1, -(1 - (a/a)^)]. The lemma follows the 
mean value theorem. 

D 

Proof of Theorem 7.1. The proof is based on the two lemmas that follows. Sup- 
pose the posterior is not d„ consistent on H' with n{'H') > 0. Then by Lemma 
C.l, (2) must hold for rjo G H' . By Lemma C.2, (4) must hold. But (4) contradict 
that tt{H) ~ 1, since then from all M: 7r{?7 : Prji-Jr] — fjngeM)} > 0. 

D 

Lemma C.l Suppose 

1. There is a statistic f]^ such that limsup^j Pj,j,((i„(?7„, ryg) > M) — > as 
M-> oo. 

2. For all M < oo: limsup„ P^„ (7rjf„(d„(r;„, r/o) > 2M) > 2e) > 26. 

Then there is M which may depend on rjQ such that 

hm sup P,-i„ (7rx„ (d„ {Vn ^ijn) > M) > e) > 6. (2) 

Proof 

> Pv„{{^xAdniVn,Vo) > 2M) > 2e} n {d„ (/)„,%) < M}) 

> Pvo{T^xJdn{Vn,Vo) > 2M) > 2e) - P^„(d„(77„, %) > M) 

By assumption the lim-sup of the first term on the RHS is bounded by 2d, while 
we can choose M large enough such that the second term is bounded by S for 
all large enough n. The lemma follows. D 

Lemma C.2 Suppose there exist a statistic f)n, M , e, (5 > such that 

P^r,o{^xMn{f]n,Vn) > M) > e) > 5 (3) 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 22 

for all rjo £ H' and 7r(7^') > 7 > 0. Then for all M < 00; 

Pidn{Vn,Vo)>M)>e5-/, (4) 

Proof Obviously, iff/, V, W are three random variables, then E{E{E{U\V)\W) = 
E{U). Hence taking the expectation of (3), we obtain (4). D 

References 

Bayarri, M. and Berger, J. (2004). The interplay between Bayesian and frcquen- 

tist analysis. Statistical Science, 19, 58-80. 
Berger, J. (2006a). The case for objective Bayesian analysis. Bayesian Analysis, 

1, 385-402. 
Berger, J. (2006b). Rejoinder. Bayesian Analysis, 1, 457-464. 
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd 

ed.). Springer- Verlag, New York. 
Berger, J. O. and Wolpert, R. L. (1988). The Likelihood Principle: A Review, 

Generalizations, and Statistical Implications (2nd ed.)., volume 6 of Lecture 

Notes — Monograph Series. IMS, Hayward, California. 
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1998). Efficient 

and adaptive estimation in semiparametric models. Springer- Verlag, New 

York. 
Bickel, P. J. and Kleijn, B. J. (2012). The semiparametric Bcrnstein-von Mises 

theorem. Ann. Statist., 2012, To appear. 
Bickel, P. J. and Ritov, Y. (1989). Estimation of squared integrated density 

derivatives. Sankhya (1989)., A50, 381-393. 
Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be 

"plugged-in" . Ann. Statist., 31(4), 1033-1053. 
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley, New York. 
Cox, D. (1993). An analysis of Bayesian inference for nonparametric regression. 

Ann. Statist, 21, 903-924. 
Diaconis, P. and Freedman, D. (1998). Consistency of Bayes estimates for non- 

parameteric regression: Normal theory. Bernoulli, 4, 411-444. 
Engle, R. P., Granger, C. W. J., Rice, J., and Weiss, A. (1986). Nonparametric 

estimates of the relation between weather and electricity sales. J. Amer. 

Statist. Assoc, 81, 310-320. 
Freedman, D. (1993). On the asymptotic behavior of Bayes estimates in the 

discrete case i. Ann. Math. Statist., 34, 1386-1403. 
Freedman, D. (1999). On the Bernstein- von Mises theorem with in 

nite dimensional parameters. Ann. Statist., 27, 1119-1140. 
Ghosal, S., Ghosh, J., and van der Vaart, A. (2000). Convergence rates of 

posterior distributions. Ann. Statist., 28, 500-531. 
Goldstein, M. (2006). Subjective Bayesian analysis: Principles and practice. 

Bayesian Analysis, 1, 403-420. 
Harmeling, S. and Toussaint, M. (2007). Bayesian estimators for robins-ritov's 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 23 

problem. Technical report, University of Edinburgh, School of Informatics 
Research Report EDI-INF-RR-1189. 

Kleijn, B. and van der Vaart, A. (2006). Misspecification in infinite-dimensional 
Bayesian statistics. Ann. Statist., 34, 837-877. 

Le Cam, L. and Yang, G. (1990). Asymptotics in Statistics: Some Basic Con- 
cepts. Springer, New York. 

Lehmann, E. and Casella, G. (1998). Theory of Point Estimation. Springer, 
New York. 

McShane, B. B. and Wyner, A. J. (2011). A statistical analysis of multiple 
temperature proxies: Are reconstructions of surface temperatures over the 
last 1000 years reliable? Ann. Appl. Stat., 5, 5-44. 

Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appro- 
priate (coda) asymptotic theory for semiparametric models. Statistics in 
Medicine, 17, 285-319. 

Schick, A. (1986). On efficient estimation in regression models. Ann. Statist., 
14, 1486-1521. 

Wasserman, L. (1998). Asymptotic properties of nonparametric Bayesian pro- 
cedures. In Lecture Notes in Statistics, vol. 133: Practical Nonparametric 
and Semiparametric Bayesian Statistics, editors: D. Dey, P. Miiller and D. 
Sinha (pp. 293-304). Springer, New York. 



Ritov, Bickel, Gamst, Kleijn/Can Bayes be CODA ? 
p J! (MLE) 



24 





(a) 



(b) 



The Bayes estimator 





(c) 



(d) 



Fig 1: Computing linear functionals. (a) The vector /3; (b) The vector (3 + e; 
(c) The Bayes estimator; (d) The functional h. 



X (MLE) 



The Bayes estimator 





(a) (b) 

Fig 2: (a) l3 + s; (b) The Bayes estimator. 



